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Abstract 



Recently, it has become increasingly important for computer security researchers and incident in- 
vestigators to have access to larger and more diverse data sets. At the same time, trends towards 
protecting customer privacy have grown as a result of many embarrassing releases — of supposedly anony- 
mous information — which has been traced back to individual computer users [71 [15]. This has further 
increased reluctance of data owners to release large data sets to the research community or to share logs 
relevant to attacks from a common threat. The burgeoning field of data sanitization has helped alleviate 
some of these problems as it has recently provided many new tools for anonymizing sensitive data, but 
there is still a difficult trade-off to be negotiated between the data owner's need for privacy and security 
and the analyst's need for high utility data. Data sanitization policies must be created that are secure 
enough for the first party, but do not result in too much information loss to be usable to the second. 

Necessary to solving this problem of negotiating policies for data sanitization is the ability to analyze 
the effects of anonymization on both the security of the sanitized data and the utility left after anony- 
mization. In this paper, we focus on analyzing the utility of network traces post-anonymization. Of 
course, any such measure of utility will naturally be subjective to the type of analysis being performed. 
So this work scopes the problem to utility for the task of attack detection. We employ a methodology 
we developed that analyzes the effect of anonymization on Intrusion Detection Systems (IDS), and we 
provide the first rigorous analysis of single field anonymization on IDS effectiveness. This work will begin 
to answer the questions of whether the field affects anonymization more than the algorithm; which fields 
have a larger impact on utility; and which anonymization algorithms have a larger impact on utility. 

1 Introduction 

The ability to safely share log files and network traces has become increasing important to several commu- 
nities: networking research, computer security research, incident response, and education [23 . Synthetically 
generated data is abundant, but has been highly criticized for many uses |14j . Honeynets can be useful 
in generating exercises for students and help meet the needs of educators [1], but very few honeynets have 
been setup on a scale to generate some of the large, cross-sectional data sets needed by computer security 
researchers |25| . Furthermore, using a honeynet does not necessarily release the owner from all legal respon- 
sibility when sharing data |19]. Lastly, nothing but real data will suit the needs of the incident responder 
who must share data about specific attacks under investigation of real machines. Therefore, there is a high 
demand for methods to share real log files and network traces within several communities. 

At the same time as there is an increased need to share these data sets, there is increased reluctance. 
First, many data owners recognize the inherent security risks of releasing detailed information that could be 
used to map out their own networks, services or sensor locations |3] [3J [Sj UHl HSJ HZIIini HH] ■ Secondly, there 
are serious privacy concerns that companies have about releasing customer data, especially in light of recent 
incidents where some publicly released data — believed to be anonymous — leaked information about specific 



users [71 [in]. Lastly, recent research has questioned the legahty of releasing much of the data as has been 
done up until now [T^. 

The burgeoning field of data sanitization has helped address this tension by providing organizations, who 
wish to share their data, with new tools to anonymize computer and network logs [H \5\ UHl HSl 1201 HH 
|221l221l2ni HZ] • However, little has been done to help users negotiate the difficult trade-off between the data 
owner's need for security and privacy, and the data analyst's need for high quality data — what is called the 
utility vs. security trade-off |20j . As anonymization is an inherently lossy process, and the data analyst wants 
information as close to the original as possible, there is always this tension and a need to negotiate policies 
that meet the needs of both parties. 

Necessary to solving this problem of negotiating policies for data sanitization is the ability to analyze the 
effects of anonymization on both the security of the sanitized data and the utility left after anonymization. 
In this paper, we focus upon the latter problem of evaluating the effects of anonymization on the utility 
of the data sets to be shared. Of course, utility is subjective since it depends upon who is using the data, 
or more specifically, for what purpose it is being used. Hence, what is important to a researcher in the 
network measurements community may be completely irrelevant to the incident responder. Therefore, we 
have scoped this work to evaluating utility for attack detection. 

The task of attack detection is an important part of the incident responder's daily job. When investigating 
broad attacks, of which their organization is only a part, they may have to settle for anonymized logs from the 
other sites involved. It would be a similar case when using a distributed or collaborative intrusion detection 
system that crosses organizational boundaries. Output from the sensors may need to be anonymized. Not 
only is attack detection important in these "real world" applications, but it is important to the intrusion 
detection research community, as well. This community has often complained that their only good data sets 
to test new technologies against are synthetic. However, if they can still do their analysis with anonymized 
data, then they are more likely to obtain large, real data sets. 

The utility of a data set is not only constrained by the type of analysis being done with it but also 
the type of data being shared. We have chosen to look at the effects of anonymization on one of the most 
commonly shared and general types of data, the pcap formatted network trace. From this type of data, 
many others can be derived (e.g., NetFlows) P]. 

To quantitatively measure the ability to identify attacks in anonymized data, we developed what we call 
the IDS Utility Metric. This measurement evaluates and compares the false positive and negative rates of a 
baseline unanonymized data set with that of an anonymized data set. By doing this, we can automate an ob- 
jective process to help us answer several questions: (1) How does anonymization of a particular field affect the 
ability to detect an attack, (2) are there unique effects when certain pairs or triplets of fields are anonymized 
together, and (3) how does the use of different types of anonymization algorithms affect attack identification? 
In this paper, we present the results of anonymizing a portion of the 1999 MIT/Lincoln Labs DARPA data 
set — by using the FLAIM [50] anonymization framework — with over 150 separate anonymization policies to 
help us answer these questions. 

The rest of the paper is organized as follows. Section [2] describes the anonymization algorithms used 
by FLAIM in our experiments, while section [3] describes the methodology and setup of our experiments. 
In section |4j we present our results and analysis. We survey the related work in section [6| and state our 
conclusions, as well as scope out future work to be done, in sections [7] and [8| 

2 FLAIM: Framework for Log Anonymization and Information 
Management 

We chose to use FLAIM [50] as our anonymization engine for several reasons. First, we could easily script 
its execution for a multitude of tests. Second, it has a very flexible XML policy language that made it 
simple to generate hundreds of unique anonymization policies. Third, it can anonymize as many or more 
fields in PCAP traces as any other anonymization tool. Lastly, FLAIM has a very rich set of anonymization 
algorithms that can be applied to all these fields. With these properties, it was the ideal tool for our 
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Alerts generated. False Positive, False Negative counts 



Figure 1: This diagram illustrates the process by which we compare anonymized logs versus non-anonymized 
logs. 



experiments. 

2.1 Anonymization Algorithms 

FLAIM implements a plethora of anonymization algorithms for several data types. The three basic data 
types available to most anonymization algorithms are binary, string, and numeric. Additionally, there are 
a few special data types like timestamps. Binary data is just treated as a string of bits with no special 
structure. Algorithms anonymizing binary data output binary data of the same length. String data are 
variable length, terminated by a null character. Anonymization algorithms that take in strings will also 
output stings. However, the length may change. Numeric data is interpreted as a number of a given base, 
specified in the anonymization policy. This is useful, for example, when working with decimal numbers like 
a port number. There, one may want to act on individual digits, rather than bits (e.g., replacing the last 3 
digits with O's). 

Table [TO] lists the different anonymization algorithms in FLAIM along with the data types they operate 
upon. Further information on these algorithms can be found in jB]. 

2.2 Anonymization Policies in FLAIM 

FLAIM provides an expressive and powerful method for specifying anonymization policies that can be mod- 
ified at run time, thus enabling efficient automation. An anonymization policy is an XML file that specifies 
the anonymization algorithms that should be applied to the various fields in the log, along with any special 
parameters to be passed to those algorithms. 



3 Methodology 

For all of our experiments, we used a subset of the 1999 DARPA evaluation data set. The Defense Advanced 
Research Projects Office (DARPA) created an Intrusion Detection Evaluation testbed in 1998 and 1999. 
Data was captured from a simulated network that was subjected to various attacks. This data set has 
been frequently used in evaluating intrusion detection systems since its creation |llj . Thus, we found it 
appropriate to use in evaluating the effects anonymization of data can have on intrusion detection. 
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As we mention, wc used but a portion of the 1999 data set. Specifically, we used the inside tcpdump data 
from Wednesday of the second week of the evaluation. Since FLAIM currently just anonymizes TCP, UDP 
and ICMP, we filtered out all other network protocols before running our experiments, whose methodology 
is depicted in Figure [T] 

The IDS Utility Metric is constructed as follows. First, the unanonymized data set is processed by Snort 
to produce the "baseline" set of alerts, using the default Snort rule sets. We consider this the ideal set 
of alerts for the data set. Next, we take the same data set, and anonymize it — in our case, with FLAIM. 
We then take the anonymized data set, and we run Snort against it with the same rule set as before. The 
alerts generated from the unanonymized file are used as a baseline against which the alerts generated by 
the anonymized file are compared. The difference between the alerts in the anonymized file, versus the 
unanonymized file, is used as a measure of the loss of utility in the log. The larger the difference, the more 
information was not available in the logs in order for the IDS to correctly identify an attack. This process 
is depicted in Figure [l] 

FLAIM schemas specify a set of anonymization algorithms that are appropriate for each field in a pcap log. 
These are summarized in Figure|4] In this evaluation, we only consider anonymization policies that transform 
single fields. Before we can evaluate the affect of multi-field policies, we must come to an understanding of 
single field policies. There are 152 single field policies that can be generated for pcap data (Figure |4] in the 
Appendix lists all the fields and anonymization algorithms used for testing) . Each anonymization algorithm 
also has parameters that affect how it anonymizes the field. We will not go over the parameters in detail, 
but instead refer readers to the FLAIM manual fB] . Table [9] in the Appendix summarizes the parameter 
settings for the anonymization algorithms. 

We iterate the process described above over each of the 152 anonymization policies, comparing the results 
pre- and post-anonymization. We describe how we compare these data sets in the next section, while the 
section after discuses the actual metric in more detail. 

3.1 Comparing Snort Alerts 

Alerts generated by Snort are defined by several properties (a full list is included in the appendix as table |8|. 
Each alert is associated with a specific packet. The relevant alert fields — for our purposes — are listed below: 

timestamp : the timestamp from the offending packet. 

sig.generator : the part of Snort generating alert. 

sig_id : the Id. number of the signature that was fired. 

msg : description of the alert. 

proto : the protocol of the offending packet. 

src : the source IP address of the offending packet. 

srcport : The source port of the offending packet. 

dst : the destination IP address of the offending packet. 

dstport : the destination port of the offending packet. 

id : Packet Id. 

To determine whether two alert sets are equal we need a way of determining if two alerts are equal. 
Normally this can be done by comparing each field of the alerts. However, in this case the alerts generated 
from the anonymized log will cause alerts that are actually equal to appear unequal. To overcome this, we 
compare alerts on fields which will not change due to anonymization. This leads to two distinct field sets that 
must be used when comparing alerts. They are shown in Table [T] Field Set 1 is used when the timestamp 
field has not been anonymized. Field Set 2 is used when the timestamp field has been anonymized. 
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Field Set 1 



Field Set 2 



timestamp 
sigJd 
id 



sigJd 



src 



srcport 

dst 
dstport 

id 
tcpseq 



Table 1: The two field sets that are used to compare alerts. 



3.2 Metrics for evaluating utility 

The purpose of anonymization is to share logs while hiding sensitive information. Anonymization, while 
inherently an information reducing procedure, must be measured in terms of the amount of information that 
is lost in the file. However, by anonymizing we can introduce new false patterns into the data. The "best" 
anonymization policy should minimize information loss, while not adding any new false patterns to the log. 

We can consider the IDS process as a pattern classification process. The data set is input, and the IDS 
classifies each packet as malicious or not. The alerts generated from the unanonymizcd data (which wc call 
the baseline data) are considered to be the correct analysis of the data set. We then compare the alerts 
generated by the anonymized data to the baseline data alert set. 

Let the set of alerts generated from the baseline file be called Abaseiine and the set of alerts generated 
from the anonymized file be called Aanony In terms of alerts we should compare Aanony^^- Abaseiine- The 
best result would be for the anonymized alerts to match, exactly, the alerts generated in the baseline set. Let 
us consider the baseline alerts to be the target set. The alerts from the anonymized file will be the generated 
set. Then we can define several metrics: 

True Positive TP = Aanony n AbaseUne\ The number of alerts in A„o„ythat are also in Abaseiine- 

False Positive FP = \Aanony — {Aanony H Abaseiine)] The uumbcr of alcrts that were generated by the 

anonymized file, but were not in the baseline file. 

False Negative FN = \Abaseiine — {Aanony Abaseiine)] The number of alerts that were not caught by the 
anonymized file. 

The True Positive rate indicates how much of the information was preserved in the aiioiiymizcd log. The 
False Positive rate indicates how many additional patterns were added to the log through anonymization. The 
False Negative rate indicate the amount of information that was removed from the log. A good anonymization 
policy should make sure both False Positive and False Negative are low while maximizing the True Positive 
rate. 

While the false positive rate is an important factor of primary importance is the false negative rate, as it 
indicates the loss of information through anonymization. For the remainder of this paper, we use both False 
Positive and False Negative rates as a measure of the utility of a log post-anonymization, but focus more on 
the false negative rate. 

4 Results and Analysis 

In this section, we describe the results of our experiments with single field anonymization policies. These 
experiments provide a substantive start to answering these questions: 

• What affects utility more, the fields that are anonymized or the anonymization algorithm? 

• Which fields have a larger impact on the utility of a log? 
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• Which anonymization algorithms have a larger impact on the utility of a log? 

To answer these questions, we evaluated all 152 pairs of fields and anonymization algorithms. For each 
pair, the number of alerts generated by Snort was calculated. The alerts generated for each pair were 
compared with the baseline alerts (See Table [2] and Table [s]). False Positives/False Negatives were calculated 
based on the definitions above, whose detailed results we discuss later. 

The unanonymized file produced 81 alerts. The number and types of alerts produced are summarized in 
Table |4] There are a total of 81 alerts generated in the baseline file, but only 19 unique types of alerts. 

4.1 Fields or Anonymization Algorithms? 

It is important to understand which causes a greater impact on utility; the field that is being anonymized 
or the anonymization algorithm that is being applied. 

To evaluate the effects of anonymizing a field we calculate the marginal of a field. The marginal of a 
field is the average number of false positive/false negatives over all anonymization algorithms. Similarly, the 
marginal of an anonymization algorithm is the false positive/false negative rate averaged over all fields. The 
marginal provides a concise summary of the effect of anonymizing a particular field or using a particular 
anonymization algorithm. 

Figure |2l shows the false positive and false negative marginals of the fields. The full data for these graphs 
is in Table]6[ Figure [3] shows the false positive and false negative marginals of the anonymization algorithms. 
The full data for these graphs is in Table |5] 



Field Vs. Avg. False Positive Rate 
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Figure 2: The left hand chart shows the marginal of each field with respect to false positives (i.e, the average 
number of false positives for a field, averaged over all anonymization algorithms). The right hand chart 
shows the marginal of each field with respect to false negatives. 

The fact that the majority of fields have no impact on utility, even when anonymized in numerous ways, 
indicates that fields are more important than the anonymization algorithm in determining the utility of an 
anonymized log. If it was the other way around, we would expect false positives and false negatives to be 
more evenly distributed over all the fields. 

Figure [2] indicates that the majority of fields generate no false positives or false negatives. The fields 
for which this is true (such as DSTJ\/IAC) did not affect the utility of the log under any anonymization 
algorithm. We can conclude that anonymizing these fields has no impact on the utility of a log, with respect 
to the IDS metric. 

Consider the false positive rate first. We can see that very few fields generated any false positives. This 
indicates that anonymization, usually, does not add new patterns. 
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Table 2: Alerts generated for Anonymization-Field pairs. AnonyAlg is the anonymization algorithm used; 
Field is the field which it was applied on; NumAlerts is the number of alerts that were generated; Num- 
TypesOfAlerts is the number of different types of alerts generated. Table 1 of 2. 



A 

AnonyAlg 


r leld 


JN umAlerts 


NumTypes 

OfAlerts 
19 


B inaryB lackMarker 


SRCJMAC 


81 


BytesTruncation 


SRCJVIAC 


81 


19 


Annihilation 


SRCJvIAC 


81 


19 


MacRandomPermutation 


SRCJVIAC 


81 


19 


BinaryB lackMarker 


DST_MAC 


81 


19 


BytesTruncation 


DST_MAC 


81 


19 


Annihilation 


DST MAC 


81 


19 


MacRandomPermutation 


DST_MAC 


81 


19 


IPv4PrefixPreserving 


IPV4_SRCJP 


1014 


5 


B inaryB lackMarker 


IPV4_SRCJP 


9 


4 


Annihilation 


IPV4_SRCJP 


9 


4 


RandomPermutation 


IPV4_SRCJP 


9 


4 


NumericTruncation 


IPV4_SRCJP 


9 


4 


IPv4PrefixPreserving 


IPV4_DST_IP 


759 


5 


BinaryB lackMarker 


IPV4_DST_IP 


9 


4 


Annihilation 


IPV4_DST_IP 


9 


4 


RandomPermutation 


IPV4_DSTJP 


14 


5 


NumericTruncation 


IPV4_DSTJP 


9 


4 


B inaryB lackMarker 


IPV4JD 


81 


19 


Annihilation 


IPV4JD 


81 


19 


NumericTruncation 


IPV4JD 


81 


19 


RandomPermut at ion 


IPV4JD 


81 


19 


Classify 


IPV4JD 


81 


19 


Annihilation 


IPV4_0FFSET 


81 


19 


B inaryB lackMarker 


IPV4_TTL 


81 


19 


Annihilation 


IPV4_TTL 


81 


19 


NumericTruncation 


IPV4_TTL 


81 


19 


RandomPermut at ion 


IPV4_TTL 


81 


19 


Classify 


IPV4_TTL 


81 


19 


Annihilation 


IPV4_CHECKSUM 


81 


19 


BinaryBlackMarker 


TCP_DST_PORT 


520290 


2 


NumericTruncation 


TCP_DST_PORT 


457410 


6 


Substitution 


TCP_DST_PORT 


923033 


2 


Annihilation 


TCPJDST_PORT 


923033 


2 


RandomPermut at ion 


TCP_DST_PORT 


18 


2 


Classify 


TCPJDST_PORT 


521670 


2 


BinaryBlackMarker 


TCP_SRC_PORT 


380527 


4 


NumericTruncation 


TCP SRC PORT 


294519 


6 



Table 3: Alerts generated for Anonymization-Field pairs. AnonyAlg is the anonymization algorithm used; 
Field is the field which it was applied on; NumAlerts is the number of alerts that were generated; Num- 
TypesOfAlerts is the number of different types of alerts generated. Table 2 of 2 



A 

AnonyAlg 


r leld 


JN umAlerts 


NumTypes 

OfAlerts 

6 


Substitution 


TCP_SRC_PORT 


922171 


Annihilation 


TCP-SRCJORT 


922171 


6 


RandomPermutation 


TCP_SRC_PORT 


20 


4 


Classify 


TCP SRCJ'ORT 


381954 


4 


BinaryBlackMarker 


TCP_SEQUENCE 


81 


19 


Numeric Truncation 


TCP-SEQUENCE 


65 


15 


Annihilation 


TCP_SEQUENCE 


65 


15 


Classify 


TCP_SEQUENCE 


81 


19 


B inary B lackMarker 


TCP_ACKJN[0 


62 


16 


NumericTruncation 


TCP ACKJ^JO 


56 


14 


Annihilation 


TCP_ACKJNfO 


56 


14 


Classify 


TCP_ACKJN[0 


81 


19 


B inaryB lackMarker 


TCP_FLAGS 


5 


3 


NumericTruncation 


TCP_FLAGS 


5 


3 


Annihilation 


TCP_FLAGS 


5 


3 


B inaryB lackMarker 


TCP -WINDOW 


81 


19 


NumericTruncation 


TCP_WINDOW 


81 


19 


Annihilation 


TCP_WINDOW 


81 


19 


Classify 


TCP_WINDOW 


81 


19 


Annihilation 


TCP_CHECKSUM 


81 


19 


BinaryBlackMarker 


TCP_URGENT 


81 


19 


NumericTruncation 


TCP_URGENT 


81 


19 


Annihilation 


TCP_URGENT 


81 


19 


Classify 


TCP_URGENT 


81 


19 


Annihilation 


TCP_OPTIONS 


81 


19 


B inary B lackM arker 


UDP_DST_PORT 


81 


19 


NumericTruncation 


UDP_DST_PORT 


81 


19 


Substitution 


UDP_DST_PORT 


81 


19 


Annihilation 


UDP_DST_PORT 


81 


19 


RandomPermutation 


UDP_DST_PORT 


81 


19 


Classify 


UDP_DST_PORT 


81 


19 


BinaryBlackMarker 


UDP_SRC_PORT 


81 


19 


NumericTruncation 


UDP_SRC_PORT 


81 


19 


Substitution 


UDP_SRC_PORT 


81 


19 


Annihilation 


UDP_SRC_PORT 


81 


19 


RandomPermutation 


UDP_SRCJPORT 


81 


19 


Classify 


UDP_SRC_PORT 


81 


19 


Annihilation 


UDP_CHECKSUM 


81 


19 


RandomTimeShift 


TS_SEC 


81 


19 


TimcUnit Annihilation 


TS_SEC 


81 


19 


Annihilation 


TS_SEC 


81 


19 


B inary B lackMarker 


TS_SEC 


83 


20 


TimeEnumeration 


TS_SEC 


399 


19 


Annihilation 


TS_USEC 


81 


19 
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Table 4: Alerts generated in the baseline unanonymized file. AlertID is the id of the alert; Num is the 
number of that type of alert generated; Desc is a description of the alert 



AlertID 


Num 


Desc 


1 


1 


(portscan) TCP PortscanQ 


323 


1 


FINGER root query 


330 


1 


FINGER redirection attempt 


332 


1 


FINGER query 


356 


1 


FTP passwd retrieval attempt 


359 


1 


FTP satan scan 


503 


4 


MISC Source Port 20 to il024 


1200 


9 


ATTACK-RESPONSES Invalid URL 


1201 


12 


ATTACK-RESPONSES 403 Forbidden 


1288 


4 


WEB-FRONTPAGE /_vti_bin/ access 


1292 


30 


ATTACK-RESPONSES directory hsting 


1418 


2 


SNMP request tcp 


1420 


2 


SNMP trap tcp 


1421 


1 


SNMP AgentX/tcp request 


2467 


1 


NETBIOS SMB D$ Unicode share access 


2470 


1 


NETBIOS SMB C$ Unicode share access 


2473 


1 


NETBIOS SMB ADMINS Unicode share access 


3151 


6 


FINGER / execution attempt 


3441 


2 


FTP PORT bounce attempt 



Anony. Alg. vs. Avg. False Positive Rate 



"S 500000 n 
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« 400000 — 
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" 100000 — 
^ 50000 -- 
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■ ,„-y 



Anonymization Algorithm 



Anonymization Algorithm vs. Avg. False Negative Rate 



■ ,1 



^ ^(f ^d^ 



Anonymization AJgorlthm 



Figure 3: The left hand chart shows the marginal of all the anonymization algorithms, with respect to false 
positives. The right hand chart shows the marginal of all anonymization algorithms with respect to false 
negatives. 
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4.2 Which fields have higher impact on utility? 

As Figure |2] and Table [6] clearly show, the majority of fields do not generate false positives. The fields that 
do are TCP_DST_PORT, TCP_SRC_PORT, IPV4_SRCJP, IPV4_DSTJP, TS_SEC, and TCP_ACKJVO. 

We can see that several of the fields resulted in an average of 81 alerts. These fields (shown in italics in 
the table) had error. Judging from these results, we can see that this set of fields did not affect the utility 
of the log as measured by the IDS metric. 

TCP_DST_PORT and TCP_SRCJPORT generated the most false alerts on average. Upon inspection of 
the generated alert files, we can see that most of the pairs generated only 2 types of alerts. In the case of 
BinaryBlackMarker there were 520286 alerts of type 524. Alert 524 is "BAD-TRAFFIC tcp port traffic" . 
The other anonymization algorithms produced the same pattern — the majority of alerts were of type 524. 

The reason for this is the value we substitute for the port field. The BinaryBlackMarker, Substitution, 
NumericTruncation, and Annihilation algorithms all replaced the port field with 0. This resulted in the 
alert being triggered for nearly all the packets in the log (See Table |9] for the parameter settings of the 
anonymization algorithms). The same problem occurs for the TCP_SRC_PORT field. 

The RandomPermutation algorithm replaced the port number with another, random port number, thus 
infrequently causing the generation of the BAD-TRAFFIC alert. It is clear from this that the false positive 
rate was greatly affected by the choice of the substitution port. However, the effect would have been less 
pronounced had we counted the types of new alerts rather than the raw alerts, themselves. 

Arguably, it is the false negative count that is most important in determining the utility of a log. A 
low false negative count indicates that little information was lost in the process of anonymization. In 
terms of false negatives, we find that there are 8 fields that have an impact on the average false nega- 
tive rate: TCP_DST_PORT,TCP_FLAGS,TCP_SRC_PORT, IPV4_DSTJP, IPV4_SRCJP,TCP^CKJ^0, 
TCP_SEQUENCE, and TS_SEC. 

4.3 What anonymization algorithms have higher impact on utility? 

Figure |3] summarizes the false positive and false negative rates with respect to anonymization algorithm. For 
each anonymization algorithm, the average false positive/false negative rate is calculated over all the fields. 
Table [5] contains the data for the graphs. 

We can see from these that most anonymization algorithms have an impact on the utility of a log. In 
contrast, the field data that we saw before showed strong structure in what fields affected utility. When 
viewed from the perspective of anonymization algorithms, there is no single anonymization algorithm that 
stands out. 

We can see from these that most anonymization algorithms have an impact on the utility of a log. In 
contrast, the field data that we saw before showed strong structure in what fields affected utility. When 
viewed from the perspective of anonymization algorithms, there is no single anonymization algorithm that 
stands out. 

It might seem like the substitution algorithm has the largest effect with a huge number of alerts. However, 
this is because of the parameter setting used. By looking at the false negative rate we can see that while 
substitution still has a large effect on utility, most of the other algorithms have an effect as well. 

5 Multi-Field Policy Analysis 

By multi-field anonymization policies, we are referring to anonymization schemes that transform two or more 
fields in a log. The bulk of this work has focused on single field policy analysis. However, single field policy 
analysis will lay the groundwork for understanding more complex multi- field policies. 

Multi-field policies are difficult to analyze because the fields are often related in subtle ways, not just very 
direct ways, such as between the ACK and SEQ numbers. The anonymization of additional fields certainly 
does not affect the total number of false positive alerts in a linear way. For example, examine Table |7] 

Here, we see that anonymizing 3 fields separately produces 1014, 7 and 399 false alerts, respectively. If 
we anonymize the first 2 fields, we get 1016 7^ 1014-1-7 alerts. This may not be too surprising. However, if 
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Table 5: False negative/positive counts for an anonymization algorithm, aggregated over all log fields. 
Anonymization Alg. False Negatives False Positive 



Annihilation 


10.62 


47312.69 


B inary B lackMar ker 


13.43 


30027.37 


F? V t PS TV n n r 1 1 o n 








Classify 


8.5 


50200.83 


IPv4PrefixPreserving 


28.8 


351 


MacRandomPermutation 








Numeric Truncation 


17.96 


32692.13 


RandomPermut at ion 


16.72 


2.11 


RandomTimeShift 








Substitution 


38 


461298.5 


TimeEnumeration 


1.25 


80.75 


TimeUnit Annihilation 









Table 6: False negative/positive counts for a field, aggregated over all anonymization algorithms 
Field False Negatives False Positive 



DST^IAC 








IPV4_CHECKSUM 








IPV4T)STJP 


72 


151 


IPV4JD 








IPV4_0FFSET 








IPV4_SRC_IP 


72 


201 


IPV4_TTL 








SRC_MAC 








TCP ACK NO 


20 


2.75 


TCP -CHECKSUM 








TCP_DSTjPORT 


77.67 


557572.33 


TCP_FLAGS 


76 





TCP_OPTIONS 








TCP_SEQUENCE 


8 





TCP SRC PORT 


75.5 


483554.83 


TCP -URGENT 








TCP -WINDOW 








TS_SEC 


1.2 


65.2 


TS_USEC 








UDP_CHECKSUM 








UDP_DST_PORT 








UDP_SRC_PORT 









Table 7: This table shows that a multi-field policy has uncertain affects on the utility of a log 

Policy Alerts 

IPV4_SRC_IP-IPv4PrefixPreserving 1014 

TCP_SRC_PORT-RandomPermutation 7 

TS_SEC-TimeEnumeration 399 

IPV4_SRC_IP,TCP_SRC_P0RT 1016 

IPV4_SRC_IP,TCP_SRC_P0RT and TS_SEC 1010 
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we do all 3 fields together, we get only 1010 alerts — rather than something around 1420 = 1014 + 7 + 399. 
This is actually fewer false alerts than any one field being anonymized in isolation. 

So while there is always more information loss when anonymizing more fields, there may be a critical 
point at which one starts getting fewer false positives as they do additional anonymization. We suspect 
that this is always the case since complete anonymization will leave nothing to alert upon. But clearly, 
more analysis must be done on multi-field anonymization, and our future work will be focused on evaluating 
multi-field anonymization policies. 

6 Related Work 

Work in data sanitization for computer and network logs to date has focused almost entirely upon devel- 
opment of tools [201 m HZl mi O and anonymization algorithms , with little to no work doing any 
formal analysis of the effects of anonymization. This is an important aspect missing from the research body 
since without it, we just have a lot of tools to haphazardly anonymize data without knowing how to do it 
wisely or effectively. This is also in stark contrast to k-anonymity |24j which is usually applied to medical 
and census type data and has received much more attention from the research community. 

While writing this paper, a new piece of work closely related to ours appeared [37]. Though this is mostly 
a paper presenting yet another anonymization tool for pcap logs, at the end they introduce a similar method 
of analyzing the effects of anonymization of pcap traces by use of an IDS. Theirs is a cursory analysis, only 
considering anonymization of one field at a time and for only a few different fields. Also, their analysis 
only considered the number of alerts, where the more alerts are generated the more "security analysis" is 
provided. Clearly, more than just the number of alerts needs to be considered when evaluating utility, such 
as the false positive and negative rates. We were unable to reproduce their results (which may be because 
they used an unspecified subset of the LBNL data sei[^ , but more troubling is the use of this data set in the 
first place. To evaluate the effects of anonymization, one must start with unanonymized data. However, this 
LBNL data set is already anonymized. So one has no clean baseline with which to compare the anonymized 
data in this case. 

7 Conclusions 

Anonymization can be a powerful tool to allow greater cooperation between organizations. The need for 
cooperation is strikingly clear. However, a clear understanding of the needs of the data provider and the 
client is necessary before flexible, effective sharing between organizations can occur. 

The objective of this paper has been to begin to formally evaluate the utility vs. security trade off. The 
IDS metric is simple yet effective in evaluating the difference in utility when anonymizing different fields in 
different ways. 

In this paper, we have focused on answering three questions: whether the field affects anonymization 
more than the algorithm; which fields have a larger impact on utility; and which anonymization algorithms 
have a larger impact on utility. 

We have provided a thorough evaluation of single field anonymization polices upon pcap formatted 
network traces. We found that the primary impact on the utility of a log is not the particular anonymization 
algorithm, but rather the field that was anonymized. 

In addition, we were able to empirically show a range of utilities for a log based on the field that was 
anonymized. The loss of utility was largest for ports and IP addresses. There was some loss of utility for the 
fields of ID, sequence number, flags, timestamp, and ACK number. However, for many of the fields there 
was no change in utility when anonymized. 

This empirical evaluation provides the basis for further work on studying the impact of more complex 
anonymization schemes on the utility of a log. 

^http:/ /www. icir.org/entcrprise-tracing/ 
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8 Future Work 



There are numerous ways in which this work can be extended. First of all, it is clear that evaluating utility 
via Snort generated IDS alerts will cause the utility to depend upon the rule set. In fact, it is as of yet 
unclear exactly how the rule set impacts the utility measure. Though, we suspect there is a significant effect 
since we found the specific fields anonymized has more effect than how it was anonymized, and Snort rules 
usually focus on a one or two fields. 

While we used Snort for all of our experiments, it is just one IDS and only one type. It would be very 
interesting to investigate whether anomaly-based IDSs are affected in similar ways. 

The strength of an anonymization algorithm is a measure of how difficult it is to "break" or "de- 
anonymize" a log that has been anonymized via the algorithm. It is clear that we want strong anonymization 
algorithms so that attackers will have a difficult time to break the algorithm. As |2D] points out, there is 
a trade off between the security of an anonymization algorithm and the utility of the log. We have not 
discussed the strength of an anonymization algorithm in this work. 

Our work is currently limited to anonymization policies for just one field. Our next step would be to 
extend this work to multiple field anonymization policies and to work with more realistic data. The DARPA 
evaluation data set is useful because it is supervised, however it is still synthetic. In the future, we will be 
gaining access to other large unanonymized data sets that we can use instead of the DARPA data. However, 
it is important for this research to have unanonymized baseline sets. 

Finally, we have still focused on utility for just one task, attack detection. An incident responder does 
more than just detect attacks, and in the future, we could look at how anonymizing logs affects other 
important security related tasks — such as alert correlation. 
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9 Appendix 



• timcstamp 

• sig_generator 

• sigJd 

• sig_rev 

• msg 

• proto 

• src 

• srcport 

• dst 

• dstport 



Table 8: Fields in a SNORT alert. 

• ethsrc 

• ethdst 

• ethlen 

• tcpflags 

• tcpseq 

• tcpack 

• tcplen 

• tcpwindow 

• ttl 



tos 
id 

dgmlen 

iplen 

icmptype 

icmpcoe 

icmpid 

icmpseq 
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Fields 



Anonymization 



Fields 



Anonymization 



Fields 



Anonymization 



SRC_MAC 
DST_MAC 



BinaryBlackMarker 

BytesTruncation 

Annihilation 

MacRandomPermutation 



TCP_SEQUENCE BinaryBlackMariaer 

TCP_ACK_NO NumericTruncation 

TCP_WINDOWS Annihilation 

TCP_URGENT Classify 



IPV4_DST_IP 
IPV4_SRC_IP 
ICMP_IPV4_SRC_IP 
ICMP_IPV4_DST_IP 



IPv4PrefixPreserving 

BinaryBlackMarker 

Annihilation 

RandomPermutation 

NumericTruncation 



IPV4_ID 

IPV4_TTL 

ICMP_TYPE 

ICMP_CODE 

ICMPJDENTIFIER 

ICMP_SEQUENCE 

ICMP_POINTER 

ICMP_IPV4_ID 

ICMP_IPV4_TTL 



BinaryBlackMarker 

NumericTruncation 

Annihilation 

RandomPermutation 

Classify 



TCP_FLAGS 



BinaryBlackMarker 
Numeric Truncation 
Annihilation 



ICMP_GATEWAY IPvAPrefixPreserving 

BinaryBlackMarker 
Annihilation 
RandomPermutation 
Numeric Truncation 
Classify 



ICMP_ORIG_DATA BinaryBlackMarker 
Annihilation 



TCP_DST_PORT 
TCP_SRC_PORT 
UDP_DST_PORT 
UDP_SRC_PORT 



Substitution 

BinaryBlackMarker 

Annihilation 

RandomPermutation 

NumericTruncation 

Classify 



ICMP_TS_ORIG 
ICMP_TS_REC 
ICMP_TS_TRANS 
TS_SEC 



Random TimeShift 

BinaryBlackMarker 

Annihilation 

TimeUnitAnnihilation 

TimeEnumeration 



IPV4_0FFSET Annihilation 

IPV4_CHECKSUM 

TCP_CHECKSUM 

TCP_OPTIONS 

UDP_CHECKSUM 

ICMP_CHECKSUM 

ICMP_IPV4_0FFSET 

ICMP_IPV4_CHECKSUM 

TS_USEC 



IPV4_LENGTH 
TCP_OFFSET 
UDP_LENGTH 
ICMP_IPV4_LENGTH 



NONE 



Figure 4: PCAP Fields and Anonymization Algorithms. Each section contains the fields (on the left) on 
which any of the anonymization algorithms on the right can be applied. For instance, Only the anonymization 
algorithms BinaryBlackMarker and Annihilation can be applied to the ICMP_ORIG_DATA field (the bottom 
left section) 
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Tabic 9: Parameters to the Anonymization Algorithms 



Anonymization Algorithm 


Parameters 


Description 


IPV4PrefixPreserving 


Passphrase: foobar 


Sets the anonymization 
passphrase. 


MacRandomPermutation 


None 


No Parameters. 


RandomTimeShift 


lowerTimeShiftLimit: 16250000 
upperTimeShiftLimit: 31500000 


Shift time by a random amount 
between the lower and upper lim- 
its. 


TimeUnit Annihilation 


timeField: years 


Annihilate the years portion of 
the timestamp. 


NumericTruncation 


numShift: 5 

radix: 2 


Shorten the field by 5 bits. 


TimeEnumeration 


baseTime: 
intervalSize: 1 


Set the time of the oldest record 
to 0. One timcstp is equal to 
a(l(liu,i!,' 1 to llio tiuu^slami) Held. 


RandomPermutation 


None 


No Parameters. 


Annihilation 


None 


No Parameters. 


Classify 


configString: 1024:0,65536:65535 


Set elements less than 1024 to 0, 
and all others to 65535. 


BinaryBlackMaxker 


numMarks: 8 
replacement: 


Mark 8 bits as 


BytesTruncation 


numbits: 20 
direction: left 


Remove 20 bits, starting from the 
left. 


Substitution 


substitute: 


Substitute field with 0. 
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Table 10: Anonymization algorithms with applicable data types. 



Anonymization Alg. 


Data Type(s) 


Description 


Prefix-preserving 


binary 


Implements prefix-preserving permuta- 






tion described in [26| 


Truncation 


binary 


Removes suffix or prefix of data by 




string 


specified number of units. 




numeric 




Hash 


binary 


Outputs cryptographic has of data. 




string 






hostname 




Black Marker 


binary 


Overwrites specified number of units 




string 


with specified constant. 




hostname 




Time Unit Annihilation 


timestamp 


Annihilates particular time units 






(e.g., hour and minute units). 


Random Time Shift 


timestamp 


Randomly shifts timestamps within 






given window by same amount. 


Enumeration 


timestamp 


Preserves order, but not distance 






between elements. 


Random Permutation 


binary 


Creates random 1-to-l mapping. 


Annihilation 


binary 


Replaces field with NULL value. 




string 




Classify 


numeric 


Partitions data into multiple non- 






overlapping subsets. 


Substitution 


binary 


Replaces all instances with a particu- 




numeric 


lar constant value. 
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